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Abstract 

We propose nested sequential Monte Carlo (NSMC), a methodology to sample from sequences 
of probability distributions, even where the random variables are high-dimensional. NSMC gen¬ 
eralises the SMC framework by requiring only approximate, properly weighted, samples from the 
SMC proposal distribution, while still resulting in a correct SMC algorithm. Furthermore, NSMC 
can in itself be used to produce such properly weighted samples. Consequently, one NSMC sam¬ 
pler can be used to construct an efficient high-dimensional proposal distribution for another NSMC 
sampler, and this nesting of the algorithm can be done to an arbitrary degree. This allows us to 
consider complex and high-dimensional models using SMC. We show results that motivate the 
efficacy of our approach on several hltering problems with dimensions in the order of 100 to 1 000. 
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Abstract 

We propose nested sequential Monte Carlo (NSMC), a methodology to sample from sequences 
of probability distributions, even where the random variables are high-dimensional. NSMC gen¬ 
eralises the SMC framework by requiring only approximate, properly weighted, samples from the 
SMC proposal distribution, while still resulting in a correct SMC algorithm. Furthermore, NSMC 
can in itself be used to produce such properly weighted samples. Consequently, one NSMC sam¬ 
pler can be used to construct an efficient high-dimensional proposal distribution for another NSMC 
sampler, and this nesting of the algorithm can be done to an arbitrary degree. This allows us to 
consider complex and high-dimensional models using SMC. We show results that motivate the 
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1. Introduction 


Inference in complex and high-dimensional statistical models is a very challenging problem that is 
ubiquitous in applications. Examples include, but are definitely not limited to, climate informatics 
(Monteleoni et al.H2013l, bioinformatics ( Cohen[ 20041 and machine learning ([Wainwright and 


Jordan 20081. In particular, we are interested in sequential Bayesian inference, which involves 


computing integrals of the form 




for some sequence of probability densities 


k > 1, 


( 1 ) 


( 2 ) 


with normalisation constants = f 7rfc(xi:fc)dxi:fc. Note that xi:k := (xi, ..., x^) € X^. The 
typical scenario that we consider is the well-known problem of inference in time series or state space 
models (Shumway and Stoffer 2011t Cappe et ahj 2005 1. Here the index k corresponds to time and 
we want to process some observations yi-k in a sequential manner to compute expectations with 
respect to the filtering distribution 7ffc(dxfc) = F{Xk G dx^ | yi-.k)- To be specific, we are interested 
in settings where 

(i) Xk is high-dimensional, i.e. Xk G with 1, and 
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Figure 1: Example of a spatio-temporal model where 'Kk{xi:k) is described by a fc x 2 x 3 undirected 
graphical model and G 


(ii) there are local dependencies among the latent variables Xi-k, both w.r.t. time k and between 
the individual components of the (high-dimensional) vectors X^. 


One example of the type of models we consider are the so-called spatio-temporal models ( Wikle[ 
2015 [Cressie and Wikle 2011 Rue and Held[ 20051. In Figure[T]we provide a probabilistic graph¬ 


ical model representation of a spatio-temporal model that we will explore further in Section]^ 
Sequential Monte Carlo (SMC) methods, reviewed in Section |2.1[ comprise one of the most 
successful methodologies for sequential Bayesian inference. However, SMC struggles in high- 


dimensions and these methods are rarely used for dimensions, say, d > 10 (Rebeschini and van 
Handel[ 20151. The purpose of the NSMC methodology is to push this limit well beyond d = 10. 
The basic strategy, described in Section |2.2[ is to mimic the behaviour of a so-called fully 


adapted SMC algorithm. Full adaptation can drastically improve the efficiency of SMC in high di¬ 
mensions. Unfortunately, it can rarely be implemented in practice since the fully adapted proposal 
distributions are typically intractable. NSMC addresses this difficulty by requiring only approxi¬ 
mate, properly weighted, samples from the proposal distribution. The proper weighting condition 
ensures the validity of NSMC, thus providing a generalisation of the family of SMC methods. Fur¬ 
thermore, NSMC will itself produce properly weighted samples. Consequently, it is possible to 
use one NSMC procedure within another to construct efficient high-dimensional proposal distribu¬ 
tions. This nesting of the algorithm can be done to an arbitrary degree. For instance, for the model 
depicted in Figure [T] we could use three nested samplers, one for each dimension of the “volume”. 

The main methodological development is concentrated to Sections |3]-^ We introduce the con¬ 
cept of proper weighting, approximations of the proposal distribution, and nesting of Monte Carlo 
algorithms. Throughout Section we consider simple importance sampling and in Section we 
extend the development to the sequential setting. 

We deliberately defer the discussion of the existing body of related work until Sectionj^ to open 
up for a better understanding of the relationships to the new developments presented in Sections 
1^ We also discuss various attractive features of NSMC that are of interest in high-dimensional 
settings, e.g. the fact that it is easy to distribute the computation, which results in improved memory 
efficiency and lower communication costs. Section|^profiles our method extensively with a state-of- 
the-art competing algorithm on several high-dimensional data sets. We also show the performance 


3 




































Naesseth, Lindsten and Schon 


of inference and the modularity of the method on a d = 1 056 dimensional climatological spatio- 
temporal model ( Fu et aL[ 2012| ) structured according to Figure[T] Finally, in Sectionj^we conclude 
the paper with some final remarks. 


2. Background and Inference Strategy 

2.1 Sequential Monte Carlo 


Evaluating 7ffc(/) as well as the normalisation constant in Q is typically intractable and we 
need to resort to approximations. SMC methods, or particle filters (PF), constitute a popular class 
of numerical approximations for sequential inference problems. Here we give a high-level intro¬ 
duction to the concepts underlying SMC methods, and postpone the details to Section]^ For a more 
extensive treatment we refer to Doucet and Johansen] (20111; Cappe et al.| ( |2005| l; poucet et al. 
([2001). In particular, we will use the auxiliary SMC method as proposed by [Pitt and Shephard 

At iteration k — 1, the SMC sampler approximates the target distribution ^k-i by a collection of 
weighted particles (samples) These samples define an empirical point-mass 

approximation of the target distribution 


" IF* 

vff_i(dxi:fc_i) := ^ (3) 

2 — ^ ^ /c 1 

where dx(dx) denotes a Dirac measure at X. Each iteration of the SMC algorithm can then con¬ 
ceptually be described by three steps, resampling, propagation, and weighting. 

The resampling step puts emphasis on the most promising particles by discarding the unlikely 
ones and duplicating the likely ones. The propagation and weighting steps essentially correspond to 
using importance sampling when changing the target distribution from itk-i to tt^, i.e. simulating 
new particles from a proposal distribution and then computing corresponding importance weights. 


2.2 Adapting the Proposal Distribution 


The first working SMC algorithm was the bootstrap PF by Gordon et al. ( 1993) , which propagates 
particles by sampling from the system dynamics and computes importance weights according to the 
observation likelihood (in the state space setting). However, it is well known that the bootstrap PF 


suffers from weight collapse in high-dimensional settings (Bickel et al. 20081, i.e. the estimate is 
dominated by a single particle with weight close to one. This is an effect of the mismatch between 
the importance sampling proposal and the target distribution, which typically gets more pronounced 
in high dimensions. 

More efficient proposals, partially alleviating the degeneracy issue for some models, can be de¬ 
signed by adapting the proposal distribution to the target distribution (see Section |4~2) . In Naesseth 
et al. I (2014aI we make use of the. fully adapted SMC method ( Pitt and Shephard] |1999| ) for doing 
inference in a (fairly) high-dimensional discrete model where Xk is a 60-dimensional discrete vec¬ 
tor. We can then make use of forward filtering and backward simulation, operating on the individual 
components of each Xk, in order to sample from the fully adapted SMC proposals. However, this 
method is limited to models where the latent space is either discrete or Gaussian and the optimal 
proposal can be identified with a tree-structured graphical model. Our development here can be 
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seen as a non-trivial extension of this technique. Instead of coupling one SMC sampler with an 
exact forward filter/backward simulator (which in fact reduces to an instance of standard SMC), 
we derive a way of coupling multiple SMC samplers and SMC-based backward simulators. This 
allows us to construct procedures for mimicking the efficient fully adapted proposals for arbitrary 
latent spaces and structures in high-dimensional models. 


3. Proper Weighting and Nested Importance Sampling 

In this section we will lay the groundwork for the derivation of the class of NSMC algorithms. 
We start by considering the simpler case of importance sampling (IS), which is a fundamental 
component of SMC, and introduce the key concepts that we make use of. In particular, we will use 
a (slightly nonstandard) presentation of an algorithm as an instance of a class, in the object-oriented 
sense, and show that these classes can be nested to an arbitrary degree. 


3.1 Exact Approximation of the Proposal Distribution 

Let 7f(x) = Z~^'k{x) be a target distribution of interest. IS can be used to estimate an expectation 
7f(/) := E^[/(A)] by sampling from a proposal distribution q{x) = Z~^q{x) and computing the 

estimator with VL* = and where {(A*, are the 

weighted samples. It is possible to replace the IS weight by a nonnegative unbiased estimate, and 
still obtain a valid (consistent, etc.) algorithm ( |Liu[|200T] p. 37). One way to motivate this approach 
is by considering the random weight to be an auxiliary variable and to extend the target distribution 
accordingly. Our development is in the same flavour, buf we will use a more explicif condifion on 
fhe relafionship befween fhe random weighfs and fhe simulated parficles. Specifically, we will make 
use of fhe following key properly fo formally justify fhe proposed algorilhms. 


Definition 1 (Properly weighted sample). A (random) pair {X,W) is properly weighted for an 
unnoTmalised distribution p if W > OandK[f(X)W] = p{f) ■= f f {x)p{x)dx for all measurable 
functions f. 


Note lhal proper weighting of {(X*, implies unbiasedness of fhe estimate of fhe nor- 


= f p(x)dx =: Zj 


p- 


malising conslanl of p. Indeed, faking f(x) = 1 gives E ^ kP* 

Infereslingly, lo conslrucf a valid IS algorifhm for our fargel vf if is sufficienl fo generafe samples 
lhal are properly weighted w.r.t. the proposal disfribulion q. To formalise fhis claim, assume fhal we 
are nol able lo simulale exaclly from q, buf lhal if is possible lo evaluate fhe unnormalised densily 
q poinl-wise. Furlhermore, assume we have access lo a class Q, which works as follows. The 
constructor of Q requires the specification of an unnormalised density function, say, q, which will 
be approximated by the procedures of Q. Furthermore, to highlight the fact that we will typically use 
IS (and SMC) to construct Q, the constructor also takes as an argument a precision parameter M, 
corresponding to the number of samples used by the “internal” Monte Carlo procedure. An object 
is then instantiated as q = Q(q, M). The class Q is assumed to have the following properties: 


(Al) Let q = Q(q, M). Assume that: 

1. The construction of q results in the generation of a (possibly random) member variable, ac¬ 
cessible as Zq = q.GetZ(). The variable Zq is a nonnegative, unbiased estimate of the nor¬ 
malising constant Zq = f q{x)dx. 
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2. Q has amember function Simulate whichretums a (possibly random) variable X = q.Simulate(), 
such that (X, Zq) is properly weighted for q. 

With the definition of Q in place, it is possible to generalise[^the basic importance sampler as in 
Algorithm[^ which generates weighted samples {(X*, targeting vf. Note that Algorithm[^ 

is different from a random weight IS, since it approximates the proposal distribution (and not just 
the importance weights). 


Algorithm 1 Nested IS (steps 1-3 for z = 1 , ..., N) 
I. Initialise q* = Q{q,M). 


2 . 

3. 

4. 


Set Zq = q*.GetZ() and X* = q®.Simulate(). 


Set = 


%(X0 

q(X^) 


Compute Z^ = ^ ZZi 


To see the validity of Algorithm [T] we can interpret the sampler as a standard IS algorithm for 
an extended target distribution, defined as n(x, u) := u Q(x, u)7r(x)q~^(x), where Q(x, u) is fhe 
joinf PDF of fhe random pair (q.Simulate(), q.GetZ()). Nofe fhaf H is indeed a PDF fhaf admifs vf 
as a marginal; for any measurable subsef A C X, 


n(A X 


Q{x,u)dxdu = E 

q{x) 




lA(X)7f(X) 


q{X) 


IT 


= q ( 1a ^ ) Zq = Tr{A), 


where fhe penulfimafe equably follows from fhe facl fhaf (X, Zg) is properly weighfed for q. Fur- 
fhermore, fhe sfandard unnormalised IS weighl for a sampler wifh largel H and proposal Q is given 
by ti 7r/q, in agreemenf wifh Algorilhm[^ 

Algorilhm[T]is an example of whal is referred fo as an exact approximation-, see e.g., Andrieu and 


Roberts] ([2009|l;|Andrieu ef ST (20101. Algorithmically, the method appears to be an approximation 


of an IS, but samples generated by the algorithm nevertheless target the correct distribution vr. 


3.2 Modularity of Nested IS 

To be able to implement Algorithm we need to define a class Q with the required properties 
(A[^. The modularity of the procedure (as well as its name) comes from the fact that we can use 
Algorithm[T]also in this respect. Indeed, let us now view if—the target distribution of Algorithm[T]— 
as the proposal distribution for another Nested IS procedure and consider the following definition 
of Q: 

1. Algorithm[^is executed at the construction of the object p = Q(7r, N), and p.GetZ() returns 
the normalising constant estimate Z.,^. 

2. p.Simulate() simulates a categorical random variable B with P(i? = i) = 
and returns X^. 

1. With q.GetZ() i —Z and q.Simulatef) returning a sample from q we obtain the standard IS method. 
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Now, for any measurable / we have, 

Z^7r(X^y 

q(X^) _ 

= Q 

where, again, we use the fact that (X^, is properly weighted for q. This implies that {X^, Z^) 
is properly weighted for tt and that our definition of Q('7r, N) indeed satisfies condition (A[^. 

The Nested IS algorithm in itself is unlikely to be of direct practical interest. However, in the 
next section we will, essentially, repeat the preceding derivation in the context of SMC to develop 
the NSMC method. 


N 


E[/(X®)Z,] = ^E 


i=l 




NZ^ 


1 ^ 

— Ve 

N ^ 


2=1 


/(X* 




4. Nested Sequential Monte Carlo 
4.1 Fully Adapted SMC Samplers 


Let us return to the sequential inference problem. As before, let ^k{xi-.k) = Z~^'Kk{xi-k) denote 
the target distribution at “time” k. The unnormalised density tt/j can be evaluated point-wise, but 
the normalising constant Ztt^. is typically unknown. We will use SMC to simulate sequentially 


from the distributions {'Kk}'k=i- particular, we consider the fully adapted SMC sampler (Pitt 


and ShephardI 1999| ), which corresponds to a specific choice of resampling weights and proposal 
distribution, chosen in such a way that the importance weights are all equal to \/N. Specifically, 
the proposal distribution (often referred to as the optimal proposal) is given by qk{xk \ xi-,k-i) = 
~^qk{xk I a;i:fc_i), where 


Zqj. 


Qk{xk I Xi'k—l) 


'^k(,Xi-k) 

'^k-lixi:k-l) 


In addition, the normalising “constant” Zg^, (xi-k-i) = f qk{xk \ xi-k-i)d.Xk is further used to define 
the resampling weights, i.e. the particles at time A: — 1 are resampled according to Zgj,(xi:fc_i) 
before they are propagated to time k. For notational simplicity, we use the convention xi-q = 0, 
qi{xi I xi:o) = vri(xi) and Zq^[xi-Q) = Z^^. The fully adapted auxiliary SMC sampler is given in 
Algorithm]^ 


As mentioned above, at each iteration A: = 1, ..., n, the method produces un weighted samples 


{Xl}fLi approximating vTfc. It also produces an unbiased estimate Z^^, of Z^^, (Del Moral 


2004 


Proposition 7.4.1). The algorithm is expressed in a slightly non-standard form; at iteration k we 
loop over the ancestor particles, i.e. the particles after resampling at iteration A: — 1, and let each 
ancestor particle j generate offsprings. (The variable L is just for bookkeeping.) This is done 
to clarify the connection with the NSMC procedure below. Furthermore, we have included a (com¬ 
pletely superfluous) resampling step at iteration A: = 1, where the “dummy variables” 
are resampled according to the (all equal) weights {Zg^(X|. q)}^^ = The analogue of 


this step is, however, used in the NSMC algorithm, where the initial normalising constant Zj^ is 
estimated. We thus have to resample the corresponding initial particle systems accordingly. 
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Algorithm 2 SMC (fully adapted) 

1. SetZ^g = 1. 

2. for A; = 1 to n 

(a) Compute x ^ 'l2j=i 

(b) Draw from a multinomial distribution with probabilities jv V —r) for j = 

(c) Set L ^ 0 

(d) for j = 1 to 

i. Draw ~ qk{- \ andletXJ.^ = X|,) fori = L+1, ..., L+m;^. 

ii. Set L L + ml.. 


4.2 Fully Adapted Nested SMC Samplers 

In analogue with Section assume now that we are not able to simulate exactly from q^, nor 
compute Zqj_. Instead, we have access to a class Q which satisfies condition (A[^. The proposed 
NSMC method is then given by Algorithmj^ 


Algorithm 3 Nested SMC (fully adapted) 

1. SetZ^o = 1. 

2. for A: = 1 to n 

(a) Initialise qf = Q{qk{- \ for j = 1, ..., N. 

(b) Set Zgj, = qf.GetZ() for j = 1, ..., N. 

(c) Compute Z^, = Z^,_^ x %} • 

Z^ 

(d) Draw from a multinomial distribution with probabilities for j = 

1, 

(e) Set L ^ 0 

(f) for y = 1 to 

i. Compute = qf .Simulate() and let = (X;J.^_j^, X^) fori = L+1, ..., L+ 
ml- 

ii. delete qf . 

hi. Set L L + ml- 
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Algorithm can be seen as an exact approximation of the fully adapted SMC sampler in Al¬ 
gorithm (In Appendix |A.l we provide a formulation of NSMC with arbitrary proposals and 
resampling weights.) We replace the exact computation of ^<ik and exact simulation from qk, by the 
approximate procedures available through Q. Despite this approximation, however, Algorithm]^ is 
a valid SMC method. This is formalised by the following theorem. 

Theorem 2. Assume that Q satisfies condition Then, under certain regularity conditions on 


the function / : i-A and for an asymptotic variance S^(/), both specified in Appendix A.1.2, 

we have 


where are generated by Algorithm^and 

Proof See Appendix |A.1.2[ 


D 


denotes convergence in distribution. 


Remark 3. The key point with Theorem^is that, under certain regularity conditions, the NSMC 
method converges at rate s/N even for a fixed (and finite) value of the precision parameter M. The 
asymptotic variance S^(/), however, will depend on the accuracy and properties of the approxi¬ 
mative procedures of Q. We leave it as future work to establish more informative results, relating 
the asymptotic variance of NSMC to that of the ideal, fully adapted SMC sampler. 


4.3 Backward Simulation and Modularity of NSMC 

As previously mentioned, the NSMC procedure is modular in the sense that we can make use of 
Algorithm]^ also to define the class Q. Thus, we now view vf^s the proposal distribution that we 
wish to approximately sample from using NSMC. Algorithm ^directly generates an estimate 
of the normalising constant of 7r„ (which indeed is unbiased, see Theorem [^. However, we also 
need to generate a sample A’i:n such that {Xi:n, -^^n) is properly weighted for 7r„. 

The simplest approach, akin to the Nested IS procedure described in Section is to draw 


uniformly on {1, ..., N} and return Xi:n = ^i n- This will indeed result in a valid definition 
of the Simulate procedure. However, this approach will suffer from the well known path degen¬ 
eracy of SMC samplers. In particular, since we call qASimulate() multiple times in Step 2(f)i of 
Algorithm]^ we risk to obtain (very) strongly correlated samples by this simple approach. 

It is possible to im prove the performance of the above procedure b y instead making use of 


a backward simulator dGodsill et 


"Zl ||2004l 


Lindsten and Schon 


20131 to simulate Xi-n. The 


backward simulator, given in Algorithm]^ is a type of smoothing algorithm; it makes use of the 
particles generated by a forward pass of Algorithm to simulate backward in “time” a trajectory 
Xi:n approximately distributed according to 7f„. 

Remark 4. Algorithni^assumes unweighted particles and can thus be used in conjunction with the 
fully adapted NSMC procedure of Algorithm^ If however, the forward filter is not fully adapted 
the weights need to be accounted for in the backward simulation; see Appendix A.1.3 


The modularity of NSMC is established by the following result. 
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Algorithm 4 Backward simulator (fully adapted) 

1. Draw Bn uniformly on {1, ..., N}. 

2. SetXn = X^". 

3. for A; = n — 1 to 1 

/ \ . jj rj ■ Jc ^ k -j- 1 : ^ . 1 T. T 

(a) Compute IV^ =-——^-for j = 1, ..., N. 

^kiXlk) 


(b) Draw Bk from a categorical distribution with probabilities 


wi 


wi 


forj = 1, 


(c) SetXfc,^ = {X^^,Xk+i-.n)- 


.,N. 


Definition 5. Let p = Q(7r„, N) be defined as follows: 

1. The constructor executes Algorithm^with target distribution iTn and with N particles, and 
p.GetZ() returns the estimate of the normalising constant 

2. p.Simulate() executes Algorithm^and returns Xi-n- 

Theorem 6. The class Q defined as in Definition^satisfies condition 


Proof See Appendix A. 1.3 


A direct, and important, consequence of Theorem]^ is that NSMC can be used as a component of 
powerful learning algorithms, such as the particle Markov chain Monte Carlo (PMCMC) method 
( Andrieu et al.[[20T0 i and many of the other methods discussed in Section]^ Since standard SMC 
is a special case of NSMC, Theorem [^implies proper weighting also of SMC. 


5. Practicalities and Related Work 

There has been much recent interest in using SMC within SMC in various ways. The SMC^ by 
Chopin et al.| (2013) and the recent method by Crisan and Miguez (20131 are sequential learning 
algorithms for state space models, where one SMC sampler for the parameters is coupled with an¬ 
other SMC sampler for the latent states. Johansen et al. ( 2012| l and Chen et al. ( 2011[ l address the 
state inference problem by splitting the state variable into different components and run coupled 
SMC samplers for these components. These methods differ substantially from NSMC; they solve 
different problems and the “internal” SMC sampler(s) is constructed in a different way (for approxi¬ 
mate marginalisation instead of for approximate simulation). Another related method is the random 


weights PF of Feamhead et al. (2010a i, requiring exact samples from q and where the importance 


weights are estimated using a nested Monte Carlo algorithm. 


The method most closely related to NSMC is the space-time particle filter (ST-PF) (Beskos 


et al.[ |2014a| ), which has been developed independently and in parallel with our work. The ST-PF 
is also designed for solving inference problems in high-dimensional models. It can be seen as a 
island PF ( Verge et'5I| 2015| l implementation of the method presented by [Naesseth et al. (2014b I. 
Specifically, for a spafio-femporal models fhey run an island PF over bofh spafial and femporal 
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dimensions. However, the ST-PF does not generate an approximation of the fully adapted SMC 
sampler. 

Another key distinction between NSMC and ST-PF is that in the latter each particle in the 
“outer” SMC sampler comprises a complete particle system from the “inner” SMC sampler. For 
NSMC, on the other hand, the particles will simply correspond to different hypotheses about the 
latent variables (as in standard SMC), regardless of how many samplers that are nested. This is 
a key feature of NSMC, since it implies that it is easily distributed over the particles. The main 
computational effort of Algorithmis the construction of and the calls to the Simulate 

procedure, which can be done independently for each particle. This leads to improved memory 
efficiency and lower communication costs. Furthermore, we have found (see Section|^ that NSMC 
can outperform ST-PF even when run on a single machine with matched computational costs. 

Another strength of NSMC methods are their relative ease of implementation, which we show 
in Section [63] We use the framework to sample from what is essentially a cubic grid Markov ran¬ 
dom field (MRF) model jusf by implemenfing fhree nesfed samplers, each wifh a largel disfribufion 
defined on a simple chain. 

There are also ofher SMC-based mefhods designed for high-dimensional problems, e.g., fhe 
block PF sfudied by Rebeschini and van Handel ( |2015 1, fhe locafion particle smoofher by Briggs 
et al. (20131 and fhe PF-based mefhods reviewed in Djuric and Bugallo ( 2013| |. However, these 
methods are all inconsistent, as they are based on various approximations that result in systematic 
errors. 

The previously mentioned PMCMC ( jAndrieu et al. 20101 is a related method, where SMC 
is used as a component of an MCMC algorithm. We make use of a very similar extended space 
approach to motivate the validity of our algorithm. Note that our proposed algorithm can be used as 
a component in PMCMC and most of the other algorithms mentioned above, which further increases 
the scope of models it can handle. 


6. Experimental Results 

We illustrate NSMC on three high-dimensional examples, both with real and synthetic data. We 


compare NSMC with standard (bootstrap) PF and the ST-PF of jBeskos et al.j ( |2014a| ) with equal 
computational budgets on a single machine (i.e., neglecting the fact that NSMC is more easily dis¬ 
tributed). These methods are, to the best of our knowledge, the only other available consistent online 
methods for full Bayesian inference in general se quenti al models. For more detailed explanations 
of the models and additional results, see Appendix |A.3|p[ 


6.1 Gaussian State Space Model 

We start by considering a high-dimensional Gaussian state space model, where we have access to 
the true solution through belief propagation. The latent variables and measurements {Xi-k, Xi-.k], 
with Yfc} = {Xfc Yfc are modeled by a d x A: lattice Gaussian MRF. The true data is 


simulated from a nearly identical state space model (see Appendix A.3.11. We run a 2-level NSMC 
sampler. The outer level is fully adapted, i.e. the proposal distribution is = p{xk \ x^-i, Uk), 
which thus constitute the target distribution for the inner level. To generate properly weighted 
samples from qk, we use a bootstrap PF operating on the d components of the vector Xk- Note that 


2. Code available at https : //github. com/can-cs/nestedsmc 
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Figure 2: Top: Median (over dimension) ESS Q and 15-85% percentiles (shaded region). Bottom: 

The ERS Q based on the resampling weights in the (outermost) particle filter. The results 
are based on 100 independent runs for the Gaussian MRE with dimension d. 


we only use bootstrap proposals where the actual sampling takes place, and that the conditional 
distribution p{xk \ Xk-i, Uk) is not explicitly used. 

We simulate data from this model for fe = 1,..., 100 for different values of d = dim(xfc) G 
{50,100, 200}. The exact filtering marginals are computed using belief propagation.We compare 
with both the ST-PE and standard (bootstrap) PE. 


The results are evaluated based on the effective sample size (ESS, see e.g. Eeamhead et al. 


(2010b l) defined as. 


ESS{xk,i) = E 




-1 


(4) 


where Xk,i denofe fhe mean esfimafes and pk,i and cr^ ^ denofe fhe frue mean and variance of 
Xk,i I Vi-.k obfained from belief propagafion. The expecfafion in Q is approximafed by averag¬ 
ing over 100 independenf runs of fhe involved algorifhms. The ESS reflecls fhe esfimafor accuracy, 
obvious by fhe definifion which is fighfly relafed fo fhe mean-squared-error. Infuifively fhe ESS 
corresponds fo fhe equivalenf number of i.i.d. samples needed for fhe same accuracy. 

We also consider fhe effective resample size (ERS, Kong el al. ( |1994 i), which is based on fhe 
resampling weighfs af fhe lop levels in fhe respective SMC algorifhms. 


ERS = 






Ik 


(5) 


The ERS is an eslimale of fhe effeclive number of unique parlicles (or particle systems in fhe case 
of ST-PE) available al each resampling slep. 

We use N = 500 and M = 2 • d for NSMC and malch fhe compulafional lime for ST-PE 
and boolsfrap PE. We reporf fhe resulls in Eigure|^ The boolslrap PE is omitted from d = 100, 
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200 due to its poor performance already for d = 50 (which is to be expected). Each dimension 
I = 1,..., d provides us with a value of the ESS, so we present the median (lines) and 15-85% 
percentiles (shaded regions) in the first row of Eigurej^ The ERS is displayed in the second row of 
Eigure]^ Note that ESS gives a better reflection of estimation accuracy than ERS. 

We have conducted additional experiments with different model parameters and different choices 
for N and M (some additional results are given in Appendix |A.3.1| ). Overall the results seem to be 
in agreement with the ones presented here, however ST-PE seems to be more robust to the trade-off 
between N and M. A rule-of-thumb for NSMC is to generally try to keep N as high as possible, 
while still maintaining a reasonably large ERS. 


6.2 Non-Gaussian State Space Model 

Next, we consider an example with a non-Gaussian SSM, borrowed from |Beskos et ah (2014ai 
where the full details of the model are given. The transition probability p{xk \ Xk-i) is a localised 
Gaussian mixture and the measurement probability p{yk | is t-distributed. The model dimension 
is d = 1 024. [Beskos et al. ( 2014a[ ) report improvements for ST-PE over both the bootstrap PE 
and the block PE by Rebeschini and van Handel ( |2015 1. We use N = M = 100 for both ST-PE 
and NSMC (the special structure of this model implies that there is no significant computational 
overhead from running backward sampling) and the bootstrap PE is given N = 10 000. In Eigurej^ 



NSMC 


1 10 20 30 40 50 60 70 80 90 100 

k 


ST-PF 

Bootstrap 


Eigure 3: Median ESS with 15 — 85% percentiles (shaded region) for the non-Gaussian SSM. 


we report the ESS (Q, estimated according to Carpenter et al. (19991. The ESS for the bootstrap PE 
is close to 0, for ST-PE around 1-2, and for NSMC slightly higher at 7-8. However, we note that all 
methods perform quite poorly on this model, and to obtain satisfactory results it would be necessary 
to use more particles. 


6.3 Spatio-Temporal Model - Drought Detection 

In this final example we study the problem of detecting droughts based on measured precipitation 
data ( jJones and Harris| |2013| ) for different locations on earth. We look at the situation in North 
America during the years 1901-1950 and the Sahel region in Africa during the years 1950-2000. 
These spatial regions and time frames were chosen since they include two of the most devastating 


droughts during the last century, the so-called Dust Bowl in the US during the 1930s (Schubert 
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et al. 20041 and the decades long drought in the Sahel region in Africa starting in the 1960s (Foley 


et al. 2003 1 Hoerling et aL[ 20061. We consider the spatio-temporal model defined by Fu et al. 


N 



Figure 4: Illustration of the three-level NSMC. 


(2012) and compare with the results therein. Each location in a region is modelled to be in either 


Fu 


mear 


a normal state 0 or in an abnormal state 1 (drought). Measurements are given by precipitation (in 
millimeters) for each location and year. At every time instance k our latent structure is described by 
a rectangular 2D grid Xk = {26^,1 j j=i, in essence this is the model showcased in Figure 

et al.| ( 20T2] l considers the problem of finding the maximum aposteriori configuration, using a 
programming relaxation. We will instead compute an approximation of the full posterior filtering 
distribution vffc (xk) = p{xk \ yi-.k)- The rectangular structure is used to instantiate an NSMC method 
that on the first level targets the full posterior filtering distribution. To sample from we run, on 
the second level, an NSMC procedure that operates on the “columns” j = 1, .. ., J. 

Finally, to sample each column X^^uj we run a third level of SMC, that operates on the individual 
components Xk^ij, i = I, ..., I, using a bootstrap proposal. The structure of our NSMC method 
applied to this particular problem is illustrated in Figure]^ 

Figurej^gives the results on the parts of North America that we consider. The first row shows the 
number of locations where the estimate of p{xk,i,j = 1) exceeds {0.5,0.7,0.9}, for both regions. 
These results seems to be in agreement with |Fu et al.| ( |2012[ Figures 3, 6). However, we also 
receive an approximation of the full posterior and can visualise uncertainty in our estimates, as 
illustrated by the three different levels of posterior probability for drought. In general, we obtain 
a rich sample diversity from the posterior distribution. However, for some problematic years the 
sampler degenerates, with the result that the three credibility levels all coincide. This is also visible 
in the second row of Figure]^ where we show the posterior estimates p{xk,ij \ yi-.k) for the years 
1939-1941, overlayed on the regions of interest. For year 1940 the sampler degenerates and only 
reports 0-1 probabilities for all sites. Naturally, one way to improve the estimates is to run the 
sampler with a larger number of particles, which has been kept very low in this proof-of-concept. 


7. Conclusions 

We have shown that a straightforward NSMC implementation with fairly few particles can attain 
reasonable approximations to the filtering problem for dimensions in the order of hundreds, or 
even thousands. This means that NSMC methods takes the SMC framework an important step 
closer to being viable for high-dimensional statistical inference problems. However, NSMC is not 
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Figure 5: Top: Number of locations with estimated p(x = 1) > {0.5, 0.7,0.9} for the two regions. 

Bottom: Estimate of p{xt^i = 1) for all sites over a span of 3 years. All results for 
N = 100, Ni = {30,40}, iVs = 20. 


a silver bullet for solving high-dimensional inference problems, and the approximation accuracy 
will be highly model dependent. Hence, much work remains to be done, for instance on combining 


NSMC with other techniques for high-dimensional inference such as localisation (Rebeschini and 


van Handel} 20151 and annealing (Beskos et al. 2014b I, in order to solve even more challenging 


problems. 
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Appendix A. Appendix 


In this appendix we start out in Section [AT] by providing a more general formulation of the NSMC 
method and proofs of the central limit and proper weighting theorems of the main manuscript. 
We also detail (Section A.2 1 a straightforward extension of nested IS to a sequential version. We 
show that a special case of this nested sequential IS turns out to be more or less equivalent to the 
importance sampling squared algorithm by Tran et al. ( |2013[ ). This relationship serves as evidence 
that illustrates that the NSMC framework being more widely applicable than the scope of problems 
considered in this article. Finally, in Section A.3 we give more details and results on the experiments 
considered in the main manuscript. 


A.l Nested Sequential Monte Carlo 

We start by presenting a general formulation of a nested auxiliary SMC sampler in Algorithm]^ In 
this formulation, qk{xk \ xi:k-i) is an arbitrary (unnormalised) proposal, normalised by 


l) 


qk{xk\xi-k-i)dxk. 


Furthermore, the resampling weights are obtain by multiplying the importance weights with the 
arbitrary adjustment multipliers Vk-i{xv.k-i-, Zq^), which may depend on both the state sequence 
xi-k-i and the normalising constant (estimate). The fully adapted NSMC sampler (Algorithm]^ in 
the main document) is obtained as a special case if we choose 


Qk{Xk I —l) 


T^k (.Xi’k) 
’^k—lixX'.k—l) 


and Zq^) = Zq^, in which case the importance weights are indeed given by Wj. = 1. 


A. 1.1 Nested SMC is SMC 


The validity of Algorithm can be established by interpreting the algorithm as a standard SMC 
procedure for a sequence of extended target distributions. If Zq^ is computed deterministically, 
proper weighting (i.e., unbiasedness) ensures that Zq^ = Zq^, and it is evident that the algorithm 
reduces to a standard SMC sampler. Hence, we consider the case when the normalising constant 


estimates Zq^ are random. 


For A; = 1, ..., n + 1, let us introduce the random variable Uk-i which encodes the complete 


xi:fc_i), M). Let the distribution of Uk-i be 
into a standard (auxiliary) SMC framework. 


internal state of the object q generated by q = Q(qfc 
denoted as | To put Algorithm 

we shall interpret steps [2alj2b] of Algorithmj^as being the last two steps carried out during iteration 
k — 1, rather than the first two steps carried out during iteration k. This does not alter the algorithm 
per se, but it results in that the resampling step is conducted first at each iteration, which is typically 
the case for standard auxiliary SMC formulations. 

The estimator of the normalising constant is computable from the internal state of q, so that we 
can introduce a function such that Zq^, = Tk{Uk-i)- Furthermore, note that the simulation of Xk 
via Xk = q.Simulate() is based solely on the internal state Uk-\, and denote by {xk \ Uk-i) the 
distribution of Xk- 
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Algorithm 5 Nested SMC (auxiliary SMC formulation) 

1. Set {Xq}^^ to arbitrary dummy variables. Set Wq = 1 for i = 1, ..., X. Set = 1. 

2. for A: = 1 to n 


(a) Initialise q-^' = Q{qk{- \ for j = 1, ..., N. 

(b) Compute = q-^ .GetZO for j = 1, ..., N. 

(c) Compute = Vk-iiXlh-i, %) for j = 1, ..., N. 


(d) Draw from a multinomial distribution with probabilities 




j = 1, 


N. 


2^£=1 ^k-r'^k-i 


for 


(e) Set L ^ 0 

(f) for y = 1 to X 

i. Compute 

ml- 

ii. Compute Wl 


qf .Simulate() and let X^.^ = (X;^.^_^,X^) fori = L+1, ..., L+ 


'^k-i{X(.^_^) i^l_iqk{Xl I X;J.^_^) 


for i = L + 1, ..., L + 


iii. delete qf. 

iv. Set L L + ml.. 


(g) Compute Z^j^ = Z^j^_ 




Lemma 7, Assume that Q satisfies condition in the main manuscript. Then, 

j Tk{Uk-l)A^ {Xk I Uk-l)'4)lf_i{Uk-l I Xl:fc-l)dttfc_l = qk{Xk I Xl,k-l). 

Proof The pair (X^, Tk{Uk-\)) are properly weighted for qk. Hence, for a measurable function /, 

E[/(Xfc)rfc((7fe_i) I xi,k-i\ = jj fixk)Tkiuk-i)^^(xk \ Uk-i)'fi^_iiuk-i \ xi..k-i)duk-idxk 

Since / is arbitrary, the result follows. ■ 


We can now define the sequence of (unnormalised) extended target distributions for the Nested 
SMC sampler as. 


Tk{Uk-l)'4>lf {Uk I Xi..k)'y^ {Xk I Uk-l) Ttkixi-.k) 




nfc(xi:fc, M0:fc) • — 


qk{.Xk I Xi-k—1 


^fc—1 1) 




and no(rto) = 'fiQ {uq). We write Qk = '^k x Ufc for the domain of flfc. 
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Lemma 8. Assume that Q satisfies condition in the main manuscript. Then, 

i^k I l)nfc_i,'U0:A;—l)duo:fc—1 — '^k—liA\-,k—l^Q.kixk \ Xl-.k—l')- 

Proof The proof follows by induction. At k = 1, we have f ri(tio) 7 ^(xi | uo)'fi^{uo)duo = 
qi{xi) by Lemma |7] Hence, assume that the hypothesis holds for > 1 and consider 

j Tk+liUk)AkiliXk+l I rifc)nfc(xi:fc,UO:fc)dUO:fc 

'^k — l{^l:k—l)Qk{^k \ ^l:k—l) 

■ j rfc+i(rifc)7^i(xfc+i I Uk)rk{uk-i)'fijf (uk \ xi-.k)^^{xk \ Uk-i)Uk-iixi:k-i,uo:k-i)duo:k 

_ T^kixv.k) (f Tk+l(Uk)j^^(Xk+l I Uk)'4’jf{Uk I Xl.,k)dUk) 

'^k—l(^^l:k—l)Qk{^k \ ^lik — l) 

■ j TkiUk-l)fi^ {Xk I Uk-l)'nk-l{xi:k-l,U 0 :k-l)dU 0 :k-l 

_ '^k{xi.k')Qk+lixk+l I X\.k')T^k—lixi:k—l)Qkixk \ Xl-,k—l) 

'^k—l{x\-,k—l)qk{xk I Xi-k—l) 




where the penultimate equality follows by applying Lemma and the induction hypothesis to the 
two integrals, respectively. ■ 


As a corollary to Lemma it follows that 


J" ^ki-^l'.kj tlQ-k)dUQ.k '^ki^l-.k)- 


( 6 ) 


Consequently, H^. is normalised by the same constant as tt^, and by defining t\.k{xi-k, uo-.k) ■= 
Zfi^Ilk{xi:k,uo:k) we obtain a probability distribution which admits itk as a marginal (note that 
Hq = Hq, which is normalised by construction). This implies that we can use ftfc as a proxy 
for vffc in a Monte Carlo algorithm, i.e., samples drawn from H^ can be used to compute ex¬ 
pectations w.r.t. TTk- This is precisely what Algorithm does; it is a standard auxiliary SMC 
sampler for the (unnormalised) target sequence H^, /c = 0, ..., n, with adjustment multiplier 
weights Uk-i{xi:k-i, Tk{uk-i)) and proposal distribution j^{xk \ Uk-i)'fi^{uk \ xi-k). The (stan¬ 
dard) weight function for this sampler is thus given by 


Wk{xv,k,UQ.,k) oc 


1 —1) 1 —1) 't~kitlk—l')')Qki^k \ —l) 


which is the same as the expression on line|2(f)ii]of Algorithm 


(V) 


A. 1.2 Central Limit Theorem - Proof of Theorem[2]in the Main Manuscript 

Now that we have established that Nested SMC is in fact a standard auxiliary SMC sampler, albeit 
on an extended state space, we can reuse existing convergence results from the SMC literature; see 
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e.g., Johansen and Doucet (2008i; Done and Moulines (20081; Done et al. (2009i; Chopin (2004i 


or the extensive textbook by Del Moral] ( |2004 1 . 

Here, in order to prove Theorem of the main manuscript, we make use of the result for the 
auxiliary SMC sampler by Johansen and Doucet (20081, which in turn is based on the central limit 
theorem by [Chopin (2004 1 . The technique used by Johansen and Doucet (20081 is to reinterpret (as 
detailed below) the auxiliary SMC sampler as a sequential importance sampling and resampling 
(SISR) particle filter, by introducing the modified (unnormalised) fargef disfribufion 


) Tfc+l('Ufc))nfc 


( 8 ) 


The auxiliary SMC sampler described in fhe previous section can fhen be viewed as a SISR algo- 
rifhm for (|^. Indeed, if we wrife Qf{xk,Uk \ xi:k-i,Uk-i) ■= ipjf {uk \ xi:k)^jf{xk \ Uk-i) for 
fhe joinf proposal disfribufion of (x^, u^), then the weight function for this SISR sampler is given 
by 


Wl,{xi.,k,uo.,k) 


_ ^'kixi-k,uo-.k) _ 

Qk tlk I , n0:fc—l) 


oc Uk{xi-k,Tk+l{Uk))Wk{xi:k,U0:k), 


(9) 


where H4 is defined in ([7]). This weighf expression fhus accounfs for bofh fhe importance weighfs 
and fhe adjusfmenf mulfipliers of fhe auxiliary SMC sampler formulafion. 

Since fhis SISR algorifhm does nof fargef Hfc (and fhus nof vffc) direcfly, we use an additional IS 
sfep fo compufe esfimafors of expeefafions w.r.f. fo n. The proposal disfribufion for fhis IS procedure 
is given by 


^0:fc) •— Qk '^k \ 1 ? 1 (3^1:/c—1; l) • 


( 10 ) 


Nofe fhaf we obfain an approximafion of ( [TOj ) affer fhe propagafion Sfep 2(f)i| of Algorifhm]^ buf 
before fhe weighting sfep. The resulting IS weighfs, for fargef disfribufion fik{xi-.k,uo-.k) and wifh 
proposal disfribufion (fTO]), are given by 


nfc(xi;fc, llQ-k) 
^k(xi:kj tt0:k) 


— • ^k(xi:ky ttQ-k) OC IT/c, U0:fc) ■ 


Hence, wifh / : i—)■ being a fesf funefion of inferesf we can esfimafe E^f^, [/] = Ejj^ [/] (wifh 

obvious abuse of nofafion) by fhe esfimafor 


N 

E 


=1 l^i=\ '^^k 


( 11 ) 


which, again, is in agreemenf wifh Algorifhm 

We have now reinferprefed fhe NSMC algorifhm; firsf as a sfandard auxiliary SMC sampler, 
and fhen furfher as a sfandard SISR mefhod. Consequenfly, we are now in fhe position of direcfly 
applying, e.g., fhe cenfral limif fheorem by Chopin ( 2004[ Theorem 1). The condifions and fhe 
sfafemenf of fhe fheorem are reproduced here for clarify. 
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For any measurable function / 
measurable function / : 0fc i— 

rM / f\ ^rM 


©0 let = VarTM(/) and define, for any 




V^if) = f^fc“i(EQM[/]) +En._jVarQM(/)], 

%^{f) = V,^if)+YaTndf), 


k > 0, 
k > 0, 
k> 0. 


Define recursively <l>fc fo be fhe sef of measurable funclions / : 0^. i—)• such fhaf fhere exisfs a 

(5 > 0 wifh Ep^ [||iy^/|p+‘^] < oo and such fhaf fhe function {xi:k-i,uo:k-i) E [W'J] is in 
Furthermore, assume fhaf fhe idenfify funcfion / = 1 belongs fo for oach k. Then, if 


follows by Chopin (2004 Theorem 1 and Lemma A.l) fhaf 


jVV2 


'' Af N 




( 12 ) 


for &ny function f such thnt the function i—^ ^qm [f - Ep [/]] is in 4>fc_i and there 

V/g k 

< oo. The convergence in ([72]) thus holds for the unweighted 


exists a (5 > 0 such that Ep 


2+5] 


samples obtained after the propagation Step |2(f)i] of Algorithm]^ but before the weighting step. 

To complete the proof, it remains to translate ([T2]) into a similar result for the IS estimator ([TT|). 


To this end we make use of Chopin (2004 Lemma A. 2) which is related to the IS correction step 
of the SMC algorithm. Specifically, for a funcfion / : i—)• lef : 0^ i—)• denofe fhe 

exfension of / fo 0^, defined by f'^{xi-k,UQ-k) = f{xi-k). Then, for any / : Xfc i—)• such 
fhaf fhe funcfion {xi-,k-i,uo-k-i) i—)• EQAf[a;fc/®] is in ^k-i and fhere exisfs a <5 > 0 such fhaf 

Eri,[+fc/^lP’'"'^] < oo, we have 


jVV2 




V(0,Ef{/)), 


i=l 2^1=1 rrfc 

where {(Xp^, IF^)}^^ are generafed by Algorifhmj^and Sf(/) = ^fc""+fc(r-EnJ/1)). 

A. 1.3 Nested SMC Generates Properly Weighted Samples - Proof of Theorem[6] 
IN THE Main Manuscript 

In fhe previous fwo seefions we showed fhaf fhe NSMC procedure is a valid inference algorifhm 
for 7f„. Nexf, we fum our affenfion fo fhe modularify of fhe mefhod and fhe validify of using fhe 
algorifhm as a componenf in anofher NSMC sampler. Lef us sfarf by sfafing a more general version 
of fhe backward simulafor in Algorilhm|^ Clearly, if fhe forward NSMC procedure is fully adapfed 
VF^ = 1, Algorifhm [^reduces fo fhe backward simulafor sfafed in fhe main manuscripf. 

We will now show fhaf fhe pair Xi:^,) generated by Algorifhms and is properly 
weighted for TTnixi-n), and fhereby prove Theoremj^in the main manuscript. 


The proof is based on the particle Markov chain Monte Carlo (PMCMC) construction (Andrieu 


et al. 20101. The idea used by Andrieu et al. (20101 was to construct an extended target distribution, 
incorporating all the random variables generated by an SMC sampler as auxiliary variables. This 
opened up for using SMC approximations within MCMC in a provably correct way; these seem¬ 
ingly approximate methods simply correspond to standard MCMC samplers for the (nonstandard) 
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Algorithm 6 Backward simulator 


1. Draw Bn from a categorical distribution with probabilities 




Ef=i 


for j = 1, ..., N. 




2. Set Xn = Xn 


3. for A: = n — 1 to 1 


(a) Compute Wl = for j = 1, ..., A^. 


^k{Xlk) 

(b) Draw from a categorical distribution with probabilities for j = 1, ..., N. 

(C) SetXk:n = {X^'°,Xk+l:n)- 

4. return Xi^n 


K 

Ell W‘ 


extended target distribution. Here we will use the same technique to prove the proper weighing 
property of the NSMC procedure. 


We start by introducing some additional notation for the auxiliary variables of the extended 


l:Ar 


target construction. While Algorithm is expressed using multinomial random variables 
in the resampling step, it is more convenient for the sake of the proof to explicitly introduce the 
ancestor indices see e.g., [Andrieu et al. (2010l. That is, A\ is a categorical random 


variable on {1, ..., N}, such that is ancestor particle at iteration A: — 1 of particle X^,. The 

resampling Stepof Algorithm]^ can then equivalently be expressed as: simulate independently 
from the categorical distribution with probabilities 




trr put 

Z^£=i 


Let Xfc = {Xl..., Xf}, Ufc = {Ul,..., C/f}, and A^ = ..., Af}, denote all the 

particles, internal states of the proposals, and ancestor indices, respectively, generated at iteration k 
of the NSMC algorithm. We can then write down the joint distribution of all the random variables 
generated in executing Algorithm(up to an irrelevant permutation of the particle indices) as, 

(N . 1 ” f ^ . i i 

4'NSMc(xi:n,U0:n,ai:„) = I ]JV’^(u*o) > ^ Qk I U 1) 

u=i J k=i [i=i Z^r=i 

(13) 

where we interpret and Wl as deterministic functions of 
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Let Bn denote a random variable defined on {1, ..., N}. The extended target distribution for 
PMCMC samplers corresponding to (fTS]) is then given by 


U0:n) 3-1:71) ^n) • — 


Z-n 




z. 




'I'NSMc(xi:n, U0:n, ai:„), 


(14) 


Andrieu et al. 

On 


(20101 


where is a deterministic function of (xi:„, uo:n, ai:„). We know from 
that l> is a probability distribution which admits tin as its marginal distribution for 
Consequently, by Q it follows that the marginal distribution of is 7f„. For later reference we 
define recursively b^-i '■= for A; = 1, ..., n, fhe particle indices for fhe frajecfory obfained by 
fracing backward fhe genealogy of fhe 5n’th particle af iferafion n. 

We now fum our affenfion fo fhe backward simulafor in Algorifhmj^ Backward simulation 
has indeed been used in fhe confexf of PMCMC, see e.g. |Whifeley] ( |2010| ; |Lindsfen and Schbii 
(20131; Lindsfen el al. (2014). The sfrafegy used for combining PMCMC wifh backward simulation 
is fo show fhaf each step of fhe backward sampler corresponds fo a partially collapsed Gibbs sam¬ 
pling step for fhe exlended large! disfribulion $. This implies lhal fhe backward sampler leaves 
invarianf. 

We use fhe same approach here, bul we need fo be careful in how we apply fhe exisling resulls, 
since fhe PMCMC disfribulion $ is defined w.r.l. fo tin, whereas fhe backward simulafor of Algo- 
rilhm|^ works wifh fhe original large! disfribulion itn- Neverlheless, from fhe proof of Lemma 1 by 
[Lindsfen ef al.| ( [2014 1 if follows fhaf we can wrile fhe following collapsed conditional disfribulion 
of $ as: 


I U0;fc_l, ai-fc, 

bf^ 

t“0:fe-l’-If TM/ I bk 




To simplify Ibis expression, consider, 

nn(xi:„,'U 0 :n) A ( Ts{Us-l)'li>f (Ug \ Xl:s)^f {Xg \ Us-l) TTg^Xl-.g) 




(15) 


^k{.X\-.k, ^0:fc) 


n 

s=fc-|-l 

‘4}^{Un I Xi-n) 
-ipt^iuhlxi-.k) 1 j- 


qg{Xg I Xl:s_l) 


—1 l) 


n 


Tg{Ug-l)'ll;^l{Ug-l\xi:g-l)'yf{Xs\Ug-l) 1 7r„(xi:n) 




s=k+l 


{xg I Xl:s-l) 


'^k i^l'.k') 

(16) 


By Lemma 1^ we know fhaf each faclor of fhe producl (in brackefs) on fhe second line inlegrales fo 
1 over Ug-i- Hence, plugging ( [T^ info ( [T5] ) and infegrafing over yields 

.T./!, I bk+i:„ ^ -ij^bk^^(i^l^k^^k+l:n}) 

\ Xl:fc, U0:fc_l, ai-fc, , 5fc+i.„) OC W,t- - , 

which coincides wifh fhe expression used fo simulale fhe index in Algorilhm Hence, sim- 
ulalion of Bk indeed corresponds fo a partially collapsed Gibbs sampling step for 4* and if will 


Ihus leave <f> invarianf. (Nole fhaf, in comparison wifh fhe PMCMC sampler derived by Lindsfen 
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et al. 


(|2014|l we further marginalise over the variables which, however, still results in a valid 


partially collapsed Gibbs step.) 

We now have all the components needed to prove proper weighting of the combined NSMC/backward 
simulation procedure. For notational simplicity, we write 

for the distribution of Bk in Algorithm]^ Let Xi-n) be generated by Algorithms]^ andLet 
/ be a measurable function and consider 

E[Z^J{X,.,n)] = I Z,,,/(xfJ-)|j]^BS,k(d6;)|^NSMc(d(xi^ 

n -1 'I 

^BS,k(d6fc) > 'l(d(xi:„,U0:n,ai;„,6'„)), 


- Ztt 




.k=l 


where, for the second equality, we have used the definition (pA]) and noted that 4'BS,n(^n) = 




Eti wi 


. However, by the invariance of 'I'BS.k w.r.t. <1>, it follows that 


E[Z^JiXi.,n)] = Z^„ j = Z^„7f„(/), 

which completes the proof. 


A.2 Nested Sequential Importance Sampling 


Here we give the definition of the nested sequential importance sampler and we show that a special 
case of this is the importance sampling squared (IS^) method by Tran et al. (2013 1 . 


A.2.1 Nested Sequential Importance Sampling 

We present a straightforward extension of the Nested IS class to a sequential IS version. Consider 
the following definition of the Nested SIS Q: 

1. Algorithm [ t] is executed at the construction of the object p = Q{Trn, N), and p.GetZ() returns 
the normalising constant estimate Z^^^. 

2. p.Simulate() simulates a categorical random variable B with = i) = IF^/ 
and returns X^^. 

Note that we do not require that the procedure Q is identical for each individual proposal q^, 
thus we have a flexibility in designing our algorithm as can be seen in the example in Section [A.2.2 
We can motivate the algorithm in the same way as for Nested IS and similar theoretical results hold, 
i.e. Nested SIS is properly weighted for 7r„ and it admits 7f„ as a marginal. 


A.2.2 Relation to IS" 


Here we will show how IS^, proposed by Tran et al. ( |2013 1, can be viewed as a special case of 
Nested SIS. We are interested in approximating the posterior distribution of parameters 9 given 
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Algorithm 7 Nested SIS {all ifor 1, ..., N) 


1. Initialise q* = Q(gi(-),M). 

2. Set = q\GetZ(), XI = q*.Simulate(). 

3. Set Wl = ■ 

4. delete q*. 

5. for k = 2 to n: 

(a) Initialise q* = Q(qk(- 1 

(b) Set = q*.GetZ(), = q*.Simulate(). 

(c) Set Wl = Wl 

(d) delete q*. 

(e) SetXl.,^{Xl.,_„Xl) 

6. Compute Z^„ = ^ELw^A- 


some observed values y 

TT{9\y) (xp{y\e)p{e). 

We assume that the data likelihood p{y \ 9) can, by introducing a latent variable x, be computed as 
an integral 

piy I I 


Now, let our target distribution in Nested SIS be Tr2{9, x) = 'K 2 {x \ d)T^i{6) = '^p(y\9) ^ ^^ P(9)- 

We set our proposal distributions to be 


qi{0) = 9is{d), 

-! |a^ p{y\x,o)p{x\e) 

p(m ■ 

First, Q(gi(-), 1) runs an exact sampler from the proposal g\^. Then at iteration A: = 2 we let the 
nested procedure Q(g 2 (-1 0*), M) be a standard IS algorithm with proposal h{x\y,9), giving us 
properly weighted samples for q 2 . Putting all this together gives us samples 0* distributed according 
to g\s{9) and weighted by 


FF2* oc 


p^Qi) p(2/|a:*,0*)p(rc*|0*)ME<;=i 
5is(^*) 


p{y I , 9 ’‘) p { x ^ I 6*) 
h ( x ^ I y , 9 ^) 


p{y I x% 6^)p{x'^ I 0* 


PMiy\d")pid" 

9is{9") 


(17) 
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p{y \ \ O ’-) 

\M. 


where Pat( y I 6**) = M 

identical to the IS^ algorithm proposed by|Tran et al.|(|2013 1. 




Thus we obtain a Nested SIS method that is 


A.3 Further Details on the Experiments 

We provide some further details and results for the experiments presented in the main manuscript. 


A.3.1 Gaussian State Space Model 

We generate data from a synthetic d-dimensional (dim(xfc) = d) dynamical/spatio-tempora|^model 
defined by 


Tfc I Tfc—1 ~ A/"(tA;, /ifc(Tfc—l)) S), 
Pfc I Xfc ~ Af{yk-,Xk,T^^I), 


where S and are given as follows 


/Tp + -T^ 

Ttp Tp + 2 t .0 


E = 


0 ••• 

-r^ 0 


0 


0 

V 0 


0 

0 


0 
0 

0 

0 

Ta 

Xip Tp 

T0 


0 


l) — (lTpYjXk—1- 


0 

0 

0 

0 

0 

T0 

Tn + r, 


'ip/ 


Alternatively, in a more standard state space model notation, we have 


Xk = Axk-i + Vk, Vk ~ AA(0, Q), 
yk = Xk + efc, efc ~ Af(0, R), 

where A = OTpS, Q = T. and R = We assume that the parameters 6 = {Tpj,a,Tp,Tp)) = 

(1,0.5,1,10) are known. 

To do inference with this generated data-set {yk} we propose to target the following slighdy 
different model 


k 

p{xi:k,yi-.k) OC Y { 4 >{ Xj , yj ) p { Xj )' lp { Xj , Xj - l ), 
1=1 


3. Note that in a previous version this was erraneously stated as equivivalent to the Gaussian MRF we use for sequential 
inference. Thus this example actually illustrates a problem where we have a misspecified model. However, this 
misspecification does not lead to any discernible difference in the MSE results. This because the exact filtering 
marginals for the two different models (LGSS, GMRF) with the parameters chosen differs with orders of magnitudes 
much lower than the Monte Carlo errors. 
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where the observation potential cj) and interaction potentials p and are given by 

d d 


(t}{xk,yk) = Y\4>i{xk,i,yk,i) = , 

1=1 i=i 

d ^ r 

i’ixk) = Y\'ipi{xk,i,Xk,i-i) = 

1=2 1=2 

d d 

p{xk,Xk-i) = '[\pi{xk,i,Xk-i,i) = 


1=1 


1=1 


This can be visualised as a Gaussian rectangular {d x k) lattice MRF, i.e. it grows with “time” k. The 
goal is to estimate the filtering distribution p{xk \ yi-.k)- Note that this model has almost identical 
filtering marginals as the data generating distribution and leads to a simpler implementation of 
NSMC and ST-PF. 

Results (mean-squared-error, MSB) comparing NSMC and ST-PF for different settings of N 
and M can be found in the first row of Figure and the second row displays the results when 
comparing ST-PF to the SMC method by [Naesseth et al. (2014b I for equal computational budgets. 
We show median (over dimensions d) MSE for posterior marginal mean and variance estimates 
of the respective algorithms. True values are obtained using belief propagation. Note that setting 
N = 1 in ST-PF can be viewed as a special case of the SMC method by Naesseth et al. (2014b I. 

A.3.2 Spatio-Temporal Model - Drought Detection 

We present the full model for drought detection in our notation, this is essentially the model by 


et al. (2012) adapted for estimating the filtering distribution. The latent variables for each location 


on a finite world grid, Xk,ij, are binary, i.e. 0 being normal state and 1 being the abnormal (drought) 
state. Measurements, yk,i,j, are available as real valued precipitation values in millimeters. The 
probabilistic model for filtering is given as. 


P{xi:k,yi:k) OC cl){Xn,yn)p{Xn)'lp{Xn,Xn-l), 


(18a) 


n=l 


where 


1 


4*{Xk-iyk) — IT IT ®^P 1 9 2 P'ah,i,jXk^iJ Pnorm,i,j{^ Xk^ij)) 

i=lj=l { 

I .1 

P{Xk) = IT IT {^1 } , 

i=lj=l 
I J 

'iii{xk,xk-i) =. 

i=ij=i 


(18b) 

(18c) 

(18d) 


Here, 1 is the indicator function, and with the convention that all expressions in ( |18c| ) that end up 
with index 0 evalute to 0. The parameters Ci, C 2 are set to 0.5, 3 as in (|Fu et al.[|2012|). Location 
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Median MSE for 


Median MSE for 'Var{xk,e) 



Nested SMC (N=1000, M=50) 
Nested SMC (N=667, M=75) 
Nested SMC (N=500, M=100) 
Nested SMC (N=100, M=500) 
ST-PF (N=100, M=1000) 
ST-PF (N=200, M=500) 

ST-PF (N=400, M=250) 


Median MSE for E[a: 


Median MSE for Var(a:fc^£) 



Figure 6: Top: Comparisons for different settings of N and M on the 50-dimensional SSM. Bot¬ 
tom: Illustrating the eonneetion between ST-PF and the SMC method by |Naesseth et al. 
([2014§. 


based parameters CTij, p,ah,i,j, Pnonn,ij are estimated based on data from the CRU dataset with world 
preeipitation data from years 1901 — 2012. For the North Ameriea region we eonsider a 20 x 30 
region with latitude 35 — 55°N and longitude 90 — 120°VF. For the Sahel region we eonsider a 
24 X 44 region with latitude 6 — 30° A" and longitude 10° VF — 35° E. Note that for a few locations in 
Africa (Sahel region) the average yearly precipitation was constant. For these locations we simply 
set Pnorm,i,j to be this value, Pab,i,j = 0 and afj to be the mean variance of all locations, thus this 
might have introduced some artifacts. Some representative results for the Sahel region are displayed 
in Figure]^ 
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